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Abstract We consider the setting of sequential prediction of arbitrary sequences 
based on specialized experts. We first provide a review of the relevant literature and 
present two theoretical contributions: a general analysis of the specialist aggrega- 



tion rule of Freund et al. [1997 and an adaptation of fixed-share rules of Herbster 



and Warmuth 



1998 



in this setting. We then apply these rules to the sequen- 
tial short-term (one-day-ahead) forecasting of electricity consumption; to do so, 
we consider two data sets, a Slovakian one and a French one, respectively con- 
cerned with hourly and half-hourly predictions. We follow a general methodology 
to perform the stated empirical studies and detail in particular tuning issues of the 
learning parameters. The introduced aggregation rules demonstrate an improved 
accuracy on the data sets at hand; the improvements lie in a reduced mean squared 
error but also in a more robust behavior with respect to large occasional errors. 
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1 Introduction and motivation 



We consider the sequential prediction of arbitrary sequences based on expert ad- 
vice, the topic of a large literature summarized in the monography of |Cesa-Bianchi| 
and Lugosi 2006 . At each round of a repeated game of prediction, experts output 



forecasts, which are to be combined by an aggregation rule (usually based on their 
past performance); the true outcome is then revealed and losses, which correspond 
to prediction errors, are suffered by the aggregation rules and the experts. We are 
interested in aggregation rules that perform almost as well as, for instance, the 
best constant convex combination of the experts. In our setting, these guarantees 
are not linked in any sense to a stochastic model: in fact, they hold for all sequences 
of consumptions, in a worst-case sense. 

The application we have in mind -the sequential short-term (one-day-ahead) 
forecasting of electricity consumption- will take place in a variant of the basic 
problem of prediction with expert advice called prediction with specialized (or 
sleeping) experts. At each round only some of the experts output a prediction 
while the other ones are inactive. This more difficult scenario does not arise from 
experts being lazy but rather from them being specialized. Indeed, each expert is 
expected to provide accurate forecasts mostly in given external conditions, that 
can be known beforehand. For instance, in the case of the prediction of electricity 
consumption, experts can be specialized to winter or to summer, to working days 
or to public holidays, etc. 

The literature on specialized experts is -to the best of our knowledge- rather 
sparse. The first references are Blum 1997 and Freund et al. [1997 ; they respec- 



tively introduce and formalize the framework of specialized experts. They were 
followed only by few other ones: two papers mention some results for the context 
of specialized experts only in passing ( [Blum and Mansour 2007[ Sections 6-8] 
and [Cesa-Bianchi and LugosT 2003 Section 6.2 ) while another one considers a 



somewhat different notion of regret, namely, Kleinberg et al. 



2008 



The theory of prediction with expert advice has of course been already applied 
to real data in many fields; we provide a list and a classification of such empirical 
studies in Section |2.4[ We only mention here that as far as the forecasting of 
electricity consumption is concerned, a preliminary study of some aggregation 
rules for individual sequences was already performed for the daily prediction of 
the French electricity load in Goude [2008a|b[ . 



Contributions and outline of the paper 



We review in Section [2[ the framework of sequential prediction with specialized 
experts. Three families of aggregation rules are discussed, which were for two of 
them obtained by taking a new look at existing strategies; this new look corre- 
sponds to (slight or more important) adaptations of these existing strategies and 
to simpler or more general analyses of their theoretical performance bounds. Fi- 
nally, a practical online tuning of these aggregation rules is developed and put in 
perspective with respect to theoretical methods to do so. 

We then study, respectively in Sections [4] and [sj the performance obtained 
by the developed aggregation rules on two data sets. The first one was provided 
by the Slovakian subbranch of EDF ( "Electricite de France" , a French electricity 
provider) and represents its local market; the second one deals with the French 
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market for which EDF is stih the overwhelming provider. These empirical stud- 
ies are organized according to the same standardized methodology described in 
Section [3] construction of the experts based on historical data; tabulation of the 
performance of some benchmark prediction methods; results obtained by the se- 
quential aggregation rules, first with parameters optimally tuned in hindsight, and 
then when the tuning is performed sequentially according to the introduced online 
tuning. The section on French data is also followed by a note (Section 5.61 on the 
individual performance of the aggregation rules, i.e., an indication that their be- 
havior is not only good on average but also that the large prediction errors occur 
less frequently for the aggregation rules than for the base experts. 



2 Aggregation of specialized experts: A survey with some new results 



The following framework was introduced in Blum [1997 and further studied in Fre- 
und et al.| [1997] 



A bounded sequence of observations (e.g., hourly or half- hourly electricity con- 
sumptions) yi,y2, ■ ■ ■ ,yT G [Oi ^] is to be predicted element by element at time in- 
stances t = 1,2, ... ,T. A finite number A'^ of base forecasting methods, henceforth 
referred to as experts, are available. Before each time instance t, some experts 
provide a forecast and the other ones do not. The first ones are said active and 
their forecasts are denoted by fj^t £ K+, where j is the index of the considered 
active expert; the experts of the second group are said inactive. We assume that 
the experts know the bound B and only produce forecasts /, t G [0, B]. Finally, we 
denote by i?t C {1, ... , A*'} the set of active experts at a given time instance t and 
assume that it is always non empty. 

At each time instance t ^ 1, a. sequential convex aggregation rule produces a 
convex weight vector p^. = {pi.t, ■ ■ ■ ,PN.t) based on the past observations yi, . . . ,yt-i 
and the past and present forecasts fj^^, for all s = 1, . . . ,t and j £ Eg. By convex 
weight vector, we mean a vector p^. G such that pj,t ^ for all j = 1, . . . , A'' 
and pi^t + ■ ■ ■ +PN,t = 1; we denote by X the set of all these convex weight vectors 
over A*' elements. The final prediction at t is then obtained by linearly combining 
the predictions of the experts in Et according to the weights given by the compo- 
nents of the vector p^. . More precisely, the aggregated prediction at time instance 
t equals 

The observation yt is then revealed and instance t + 1 starts. 

To measure the accuracy of the prediction yt proposed at round t for the 
observation yt we consider a loss function £ : R x R — )• R. At each time instance t, 
the convex combination pt output by the rule is thus evaluated by the loss function 
£t:X^R defined by 

Hp) -- 

for all p G X. The subscript t in the notation it encompasses the dependencies in 
the expert forecasts t and in the outcome yt. Our goal is to design sequential 
convex aggregation rules A with a small cumulative error Ylt=i^tiPt)- To do so, 
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we will ensure that quantities called regrets (with respect to fixed experts, to fixed 
convex combinations of experts, or to sequences of experts with few shifts) are 
small. 

Possible loss functions are the square loss, defined by i{x,y) = {x — for 
all x,y € [0,B], the absolute loss £{x,y) = |a; — yj, and the absolute percentage of 
error £{x,y) = \x — y\/y, which are all three convex and bounded (so that their 
associated loss functions £t are convex and bounded as well). 



2.1 Minimizing regret with respect to fixed experts 



This notion of regret was introduced in Freund et al. [1997 and compares the error 



suffered by a rule A to the one of a given expert j only on time instances when 
j was active; formally, the regret of A with respect to expert j up to instance T 
equals 

T 

RAAj) = J2{et{Pt) - et{s,)) 1^^^^,} , (i) 
t=i 

where 5j £ X is the Dirac mass on j (the convex weight vector with weight 1 on 
j)- 

The exponentially weighted average aggregation rule 

It relies on a parameter 77 > and will thus be denoted by It chooses to 
be the uniform distribution over Ei and uses at time instance t ^ 2 the convex 
weight vector given by 

that is, it only puts mass on the experts j active at round t and does so by 
performing an exponentially weighted average of their past performance, measured 
by the regrets Rt-i{£r,,j)- 

The following performance bound is a straightforward consequence of the re- 



sults presented in Cesa-Bianchi and Lugosi [2003| (its Corollary 2 and the method- 



ology followed in its Sections 3 and 6.2). 

Theorem 1 We assume that the loss functions It are convex and uniformly hounded; 
we denote by L a umform bound on the quantities \£t{Si) — £t{Sj)\ when i and j vary 
in Et and t varies from 1 to T. The regret of Srj is bounded over all such sequences of 
expert forecasts and observations as 

max RT{£^,j)<^— + ^L^T. (3) 
j=l,...,N r] 2 

The (theoretically) optimal choice 77* = \/(21n N)/{L'^T) leads to the uniform 
bound LV2TlnN on the regret of f . This choice depends on the horizon T and 
of the bound L, which are not always known in advance; standard techniques, 
like the doubling trick or time-varying learning rates rjt can be used to cope with 
these limitations as far as theoretical bounds are concerned, see Auer et al. [2002 
Cesa-Bianchi et al.l 120071. 
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Remark 1 A slightly different family of aggregation rules based on exponentially 
weighted averages, referred to as H in the sequel (which stands for Hedge), was 
presented in [Blum and Mansour} 2007 Section 6]. It replaces the update ^ by 



-V] 



and 



Pj,t 



where the learning rates rjj now depend on the experts j = 1, 
setting these rates, uniform regret bounds of the form 



, N. By carefully 



RT{H,j) = 0{L 



1 



can be obtained. However, we checked in [Devaine et al. 2009[ Section 2.1] that 
the empirical performance of the families of rules H and £ were equal. This is why 
only the simplest of the two, £, will be considered in the sequel. 



The specialist aggregation rule 

The content of this section revisits and (together with the gradient trick recalled 
in the next section) improves on the results of [Freund et al. 1997[ Sections 3.2- 
3.4]. In the latter reference, aggregation rules designed to minimize the regret were 
introduced but their statement, analyses, and regret bounds heavily depended on 
the specificj^ loss functions at hand. Two special cases were worked out (absolute 
loss and square loss). In contrast, we provide a compact and general analysis, solely 
based on Hoeffding's lemma. 

The specialist aggregation rule is described in Figure [l] it relies on a parameter 
> and will be denoted by Sr^. It is close in spirit to but different from the rule 
£jj: as we will see below, the two rules have comparable theoretical guarantees, 
their statements might be found to exhibit some similarity as well, but we noted 
that in practice the output convex weight vectors pj had little in common (even 
though the achieved performance was often similar). 

Theorem 2 We assume that the loss functions It are convex and uniformly hounded; 
we denote by L a constant such that the quantities it{Si) all belong to [0,1/] when 
i varies in Et and t varies from 1 to T. The regret of Srj is bounded over all such 
sequences of expert forecasts and observations as 

In TV n T 
max RriS^J) + ^L^T. 

j = l,...,N Tj o 

The proof of this theorem is postponed to the appendix (Section [A|. The 
(theoretically) optimal choice ry* = a^/ (8\nN)/{L^T) leads to the uniform bound 
La/ (T/2) In on the regret of 5,,* . The same comments on the calibration of ri as 
in the previous sections apply. 



^ See equation (6) in Preund et al. 1997 and the comments after its statement: "Here, a 
and b are positive constants which depend on the specific on-line learning problem [...]." 
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Parameters: learning rate t] > 

Initialization: wi is the uniform convex weight vector, Wi^i = 1/N for i = I, . . . , N 
For each time instance t = 1,2, ... ,T, 

(1) predict yt = — ^ ^ tOj,t /j,t ; 

(2) observe yt and compute the convex weight vector lOt+i as 



Wit 



E 



if i e -Et, 
if i ^ Et. 



Fig. 1 The specialist aggregation rule cS,,. 



2.2 Minimizing regret with respect to fixed convex combinations of experts 



This notion of regret was introduced in Freund et al. [1997] as well and compares 
the error suffered by a rule A to the one of a given convex combination q £ X as 
follows. Formally, for a set i5 C {1, . . . , A''} of active experts, we define 



and denote by q^ 
tioning" q to E: 



q{E) = 13 

jeE 

{qf,...,qfj) the convex weight vector obtained by "condi- 



r (o,...,o) 



1^^{ieE} 



kNeE} \ 



q{E) q{E) J 

Now, the definition ([T]) can be generalized as 



0; 



if q{E) 
if q{E) > 



(4) 



This is indeed a generalization as we have RxiAySj) = RxiAyj). 

We deal with this more ambitious goal by resorting to the so-called gradient 



trick, see [Cesa-Bianchi and Lugosi 2006 Section 2.5] for more details. When the 
loss function £ : [0, M. is convex and (sub)differentiable in its first argument, 

then the functions £t are convex and (sub)differentiable over X; we denote by \7£t 
their (sub)gradient function. By denoting by • the inner product in and viewing 
A" as a subset of R^, we have the following inequality: for all t, for all q £ A", 

£tiPt) - £t{q) ^ V£t{Pt) ■ {Pt - q) = MPt) - Uq) , 

where we denoted by ^t(q) = ^£t{Pt) ■ Q the pseudo-loss function associated with 
time instance t. It is linear over X. Now, the gradient trick simply consists of 
replacing the loss functions £t by the pseudo-loss functions £t in the definitions of 
the forecasters. In particular, this replacement in ([2|, where the loss functions are 
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hidden in the regret terms, respectively, in Figure [T] leads to an aggregation rule 
denoted by f^""*, respectively, 5^""'. 

Now, the above convexity inequality and the linearity of the £t imply that for 
any rule A, 

T 

maxi?T(Aq) < max V('7t(pt) q{Et) 
qex qex ^— ^ V / 

N T 

the following result is thus a corollary of Theorems [T] and [2] 

Corollary 1 We assume that the loss functions It are convex and (sub)dijjerentiable 
over X , with (sub)gradient functions uniformly bounded in the supremum norm as t 
vanes by G. The regret of f ^""^ is bounded over all such sequences of expert forecasts 
and observations as 

maxi^Tfe^^<7) — + 2?7G^r 
while the one of S^ "^"^ is also uniformly bounded as 

max7?T(5r',q) — + ^G^T. 

q£X \ ' / „ 2 



2.3 Minimizing regret with respect to sequences of (convex combinations of) 
experts with few shifts 

This third and last definition of regret was introduced by |Herbster and Warmuth] 
[1998| and compares the performance of a rule not to the performance of a fixed 
expert or a fixed convex combination of the experts, but to sequences of experts 
or of convex combinations of experts (abiding by the activeness constraints given 
by the Et). To the best of our knowledge, this approach of considering sequences 
of experts had not been used before to deal with specialized experts. 

Formally, we denote by C the set of all legal sequences of expert instances 
j"^ = {ji, ■ ■ ■ , Jt) J where legality means that for all time instances t, the considered 
expert jt is active (i.e., is in Et). We call compound experts the elements of C. 
Similarly, we denote by C the set of all legal sequences of convex weight vectors 
qj = (qj^, . . . , Qx), where legality means that for all time instances t, the considered 
convex weight vector puts positive masses only on elements in Et- We call 
compound convex weight vectors the elements of C. 

For such compound experts jf or compound convex weight vectors , we 
denote by 

T T 

size(jf ) = hn-i^n} and size(qf ) = ^ I{q,_,^gj 

t=2 t=2 

their numbers of switches (the number minus one of elements in the partition of 
{1, . . . , T} into integer subintervals corresponding to the use of the same expert or 
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Parameters: learning rate rj > and mixing rate ^ o ^ 1 
Initialization: {w)i,o, . • . ,wm,o) = TlTT (lI{iGEi} > • ■ • > IfiVGBi}) 
For each round t = 1, 2, . . . , T, 
1 

(1) predict yt = ^ /j,t ; 

(2) [loss update] observe yt and define for each i = 1, . . . , N , 

ifieSt, 

undefined ii i ((: Et\ 

(3) [share update] let Wj^t = if j ^ ^/t+l and 

^'j.t = r^r*— r ^'>t + Tir— r + (l - °) I{ie-BtnBt+i} "j.t 

if J S i?t+i, with the convention that an empty sum is null and denoting by |i?t-)-i| the 
cardinality of Et+i- 

Fig. 2 The fixed-share aggregation rule J-",,,^. 



convex weight vector). For ^ m ^ T — 1, we then respectively define £,„ and 
Cm as the subsets of C and of C containing the compound experts and compound 
convex weight vectors with at most m shifts. When m is too small, the subsets £,„ 
and Cm might be empty. 

The regrets of a rule A with respect to jf G £ and G C are respectively 
given by 

T T 
RT{A,jI) =J2(^tiPt)-et{S^,)) and RT{A,qJ) = J2HPt)-Hqt)) ■ 



Since Cm C Cm (up to the identification of expert indexes j to convex weight 
vectors Sj), it is more difficult to control the regret with respect to all elements of 
Cm than the one with respect to simply £,„. 

The aggregation rule presented in Figure |2] (when used directly on the losses 
it) is actually nothing but an efficient computation of the rule that would consider 
all compound experts and perform exponentially weighted averages on them in 
the spirit of the rule £ri but with a non-uniform prior distribution. We will call it 
the fixed-share rule for specialized experts; we denote it by J-'j^.a as it depends on 
two parameters, > and ^ a ^ 1. This rule is a straightforward adaptation to 
the setting of specialized experts of the original fixed-share forecaster of |Herbster| 
and Warmuth [1998 , see also [Cesa-Bianchi and Lugosi 2006 Section 5.2]. 



Its performance bound is stated below; it follows from a straightforward but 



lengthy adaptation of the techniques used in Herbster and Warmuth 1998 and Ces a- 



Bianchi and Lugosi 2006 Section 5.2]. We thus provide it in the appendix of this 
paper (Section|B^or the sake of completeness and to show how the share update 
of Figure [2] was obtained. 
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Theorem 3 We assume that the loss functions It are convex and uniformly hounded; 
we denote by L a constant such that the quantities It {Si) all belong to [0, L] when i 
varies in Et and t varies from 1 to T . For all m £ {0, . . . ,T — 1}, the regret of J-ri.a 
is uniformly bounded over all such sequences of expert forecasts and observations as 

max R^{jr,j^^,jl) <iI!l±llnN+-\n—--^-^,^^ + lL^T. (5) 

The (theoretically almost) optimal bound in the theorem above can be obtained 
by defining the binary entropy H as H{x) = a:: Ina; + (1 — a;) ln(l — a;) for x £ [0, 1], 
by fixing a value of m, and by carefully choosing parameters a* and 77* depending 
on m, L, and T: 

max i?T(^„%a*,jr) sS L^'^[{m+l)\nN +{T -l)H(m/{T , 

which is o(r) as desired as soon as m = o(T). Of course, the theoretical optimal 
choices depend on T and m, so that here also sequential adaptive choices are 
necessary; see Section [2. 4| for a discussion. 

By resorting to the gradient trick defined in Section [2. 2 [ i.e., by replacing the 
losses It in the loss update of Figurejijby the pseudo-losses £t, one obtains a variant 
of the previous forecaster, denoted by T^'^a- The following performance bound is 
a corollary of Theorem [sj a formal proof is provided in appendix (Section [C|) . 

Corollary 2 We assume that the loss functions it are convex and (sub)differentiable 
over X , with (sub)gradient functions uniformly bounded in the supremum norm as t 
varies by G. For all m £ {0, . . . ,T — 1}, the regret of F^'j^ is uniformly bounded over 
all such sequences of observations and of expert forecasts as 

max i?T(^^:a , qf) ^ IniV + i In \t-„.-i + ^^'t . (6) 



2.4 Sequential automatic tuning of the parameters on data 

The aggregation rules discussed above are only semi-automatic strategies, as they 
rely on fixed-in-advance parameters 77 (and possibly q) that are not tuned on 
data. Fully sequential aggregation rules need to set these parameters online. The- 



oretically almost optimal ways of doing so exist; for instance, Auer et al. [2002 
Cesa-Bianchi et al. [2007] indicate ways to online tune the learning rates r) for 



exponentially weighted average rules £ and E^""^ so as to achieve almost the same 
regret bounds as if the parameters L, G, and T were known in advance. However, 
the learning rates thus obtained usually perform poorly in practice; see [Mallet | 



et al. [2009 for an illustration of this fact on different data sets. The same is ob- 
served on our data sets (results not reported); this does not come as a surprise as 
the theoretically optimal parameters 77* themselves perform poorly, see Remarks [2] 
and [3] in the empirical studies. Therefore, in spite of the existence of theoretically 
satisfactory methods, other ones need to be designed based on more empirical 
considerations. 

We do so below but for the sake of completeness we discuss first the symmetric 
case of the tuning of the parameter a of the fixed-share type rules. These rules need 
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actually to tune two parameters, and a; the two tunings are equally important, as 
is illustrated by the performance reported in Tables [4|and|10[ The tuning of 77 could 
be done according to the same theoretical methods as mentioned above (e.g., [Auer 



et al. 



2002 



Cesa-Bianchi et al. 



2007 



) but the same issues of practical performance 
arise. As for a, it is possible in theory not to tune it but to aggregate instances 
of the rule corresponding to different values of a, where these values lie in a thin 
enough grid; again, the rule performing this aggregation, e.g., an exponentially 
weighted average rule, needs to be properly tuned as far as its learning rate r\ 
is concerned. Such a double-layer aggregation was proposed by [Monteleoni and| 
Jaakkola 2003] , see also de Rooij and van Erven 2009 . We implemented it on our 



second data set and it turned out to have a performance similar to the empirical 
method we detail now, as long as the learning rates 77 and r\ were properly set 
both in the base rules and in the second-layer aggregation, e.g., as follows. 



An empirical online tuning of the parameters 

We describe the method in a general framework; it is due to Vivien Mallet and 
was proposed in the technical report by [Gerchinovitz et al. [2008] (but never 



published elsewhere to the best of our knowledge). Let A\ be a family of sequential 
aggregation rules relying each on some parameter A (possibly vector- valued) taking 
its values in some set A. Given the past observations and the past and present 
forecasts of the experts, the rule index by A prescribes at time instance t a convex 
weight vector which we denote by Pf [Ax) ■ 

The weights used by the fully sequential aggregation rule based on the family 
of rules A\, where A G yl, will be denoted by pj. We assume that the considered 
family is such that Pi[Ax) is independent of A, so that equals this common 
value. Then, at time instances t ^ 2, 

t-i 

Pt=Pt(-^x_) where At_i G argmin V4(ps(-4a) ) ; (7) 

that is, we consider, for the prediction of the next time instance, the aggregated 
forecast proposed by the best so far member of the family of aggregation rules. 
Because of this formulation, we will speak of a meta-rule in the sequel. We can 
however offer no theoretical guarantee for the performance of the meta-rule in 
terms of the performance of the underlying family. 

Computationally speaking, we need to run in parallel all the instances of Ax, 
together with the meta-rule. This of course is impossible as soon as A is not finite; 
for the families considered above we had A = (0,-1-00) and A = (0, -l-oo) x [0,1]. 
This is why, in practice, we only consider a finite grid A over A and perform 
the minimization of ^ only on the elements of A instead of performing it on 
the whole set A. A final choice still seems to be left to t he u ser, namely, how to 



design this finite grid A. For the first data set (in Section 4.3 ) we fix it somewhat 



arbitrarily. Ba sed o n the observed behaviors, we then propose for the second data 



set (in Section 5.51 a way to construct online the grid A, finally leading to a fully 



sequential meta-rule. 
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Literature review of empirical studies in our framework 



Several articles report applications of prediction based on expert advice to real 
data. They do not investigate the online tuning issues discussed above and can be 
clustered into three categories as far as the tuning of the parameters is concerned 
(there is often only a learning rate i] to be set). 

The first group chooses in the experiments the theoretically optimal parame- 
ters (sometimes, for instance, in the case of square losses, these are given by the 
rates 77 such that a property of exp-concavity holds). This would be possible as 
well in our context with improved regret bounds but only for the basic versions of 
our forecasters, not for their gradient versions (which will be seen to obtain a much 
improved performance in practice). Furthermore, even such choices of rj are slightly 
suboptimal on our data sets with respect to the fully sequential tuning described 
above. Actually, tuning rj in such a way, one only targets the performance of the 
best expert, not the one of the best convex combination of the experts (which is 
significantly better) . Examples of such articles and fields of application include the 
management of the tradeoff between energy consumption and performance in wire- 
less networks ([Monteleoni and Jaakkola 2003 ), the tracking of climate models 
([Monteleoni et al. 2011 Jacobs 201 1| ), the network traffic demand ( [Dashevskiy| 



and Luo[ |2011| ), the prediction of GDP data (lJacobs[ |2011|) 



line aggregation of portfolios (e.g., [Cover| 



1991 



Stoltz and Lugosi 



and also the on- 
but the 



2005 



literature is vast). In particular, as far as the latter application is concerned, we 

indicates that the studied forecasters do not differ 



note that Borodin et al. 



2000 



significantly from uniform averages of the experts; this is because the parameter 
is not set large enough. This is why we designed a method to tune it automatically 
based on past data to get the right scale of the problem. 

The second group of articles only reports results of optimal-in-hindsight pa- 
rameters (and sometimes argues that the performance is not very sensitive to the 
tuning, a fact that we do not observe on the data sets studied in this paper) . The 



Mallet 2010| ) and the prediction of outcomes of sports games ( [Dani et al.[|2006| ). 



([Mallet et al. 


2009 


([Dani et al. 


2006|). 



The third group reports the performance of various values of the parameters 



without choosing between them in advance, for instance, Vovk and Zhdanov [2008| 
for the latter application or Stoltz and Lugosi [2005 already mentioned above. 
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3 Methodology followed in the empirical studies 

We provide a standardized outline of the treatment of the two data sets discussed 
in the next sections. 



Outline of the empirical studies of performance of the sequential aggregation rules 

1. Design experts (based on some historical data). 

2. Choose a loss function and evaluate the performance of the experts (on 
new data). 

3. For each family of strategies compute the performance corresponding to 
the best constant choices of the parameters in hindsight. 

4. Assess the quality of the operational performance, i.e., the performance 



obtained after some automatic and sequential tuning (see Section 2.4 1 
Provide additional results and comments (e.g., a robustness study). 



By evaluation of the performance of the experts we mean the assessment of 
the accuracy obtained by some simple strategies like the uniform average of the 
forecasts of the active experts (a strategy easily implementable online) or by some 
oracles, like the best single expert or the best constant convex combination of the 
experts. Finally, the so-called prescient strategy is the strategy that picks at each 
time instance the best forecast output by the set of experts; it indicates a bound 
on the performance that no aggregation strategy can improve on given the data 
set (given the expert forecasts and the observations). It corresponds to the best 
element in £t-i- 



4 A first data set: Slovakian consumption data 

The data was provided by the Slovakian subbranch of the French electricity provider 
EDF. It is formed by the hourly predictions of 35 experts and the corresponding 
observations (formed by hourly mean consumptions) on the period from January 1, 
2005 to December 31, 2007. In this part and unlike for the French data set of the 
next part, we have absolutely no information on how the experts were built and 
we merely consider them as black boxes. 

As the behavior of electricity consumption depends heavily on the hour of the 
day and the data set is large enough, we parsed it set into 24 subsets (one per hour 
interval of the day) and only report the results obtained for one-day-ahead pre- 
diction on a given (somewhat arbitrarily chosen) hour interval: the interval 11:00- 
12:00. The characteristics of the observations yt of this hour frame are described 
in Table [l] while all observations (for all hour frames) are plotted in Figure [sj 

The considered loss function is the square loss and we will not report cumulative 
losses but root mean square errors (rmse), i.e., roots of the per-round cumulative 
losses. For instance, for a given convex combination q £ X, 



RMSE(q) 
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while for an aggregation rule A, 

rmse{A) 



1 



1 



In this section, we will omit the unit MW (megawatt) of the observations and 
predictions of the electricity consumption, as well as the one of their corresponding 

RMSE. 



4.1 Benchmark values: performance of the experts and of some oracles 

The characteristics of the experts are depicted in Figure |4] The bar plot represents 
the values of the rmse of the 35 available experts. The scatter plot relates the rmse 
of each of the expert to its frequency of activity, that is, it plots the pairs, for all 
experts j, 

(rmseU), . (8) 

We present in Table[2]the value^of the rmse of several procedures, all of them 
but the first two being oracles. The procedure U is an aggregation rule that simply 
chooses at each time instance t the uniform convex weight vector on Et- Its rmse 
differs from the one of the uniform convex weight vector (1/35, . . . , 1/35) as the 
rmse of the latter gives a weight to each instance t that depends on the cardinality 
of St. 

The fact that the rmse of the best compound expert with size at most 10 
is larger than the rmse of the best single expert is explained by the fact that 
some overall good experts refrain from predicting at some time instances when 
all active experts perform poorly, while compound experts are required to output 
a prediction at each time instance. The fact that such good experts tend not to 
form predictions at instances that are more difficult to cope with can also be seen 
from the fact that rmse(ZY) is larger than rmse((1/35, . . . , 1/35)) , since the second 
uniform average rule is evaluated with unequal weights put on the different time 
instances (more weight put on instances when more experts are active). 

A final series of oracles is given by partitioning time into subsets of instances 
with constant sets of active experts; that is, by defining 

{£W,...,£W} = {i5t, t€{l,...,T}} 

and by partitioning time according to the values _E^'^^ taken by the sets of active 
experts Et- The corresponding natural oracles are 

^ E E ifj",* -y*)^' ^'^ ^ -^^'^ for all = 1, . . . , 7f I , 

A;=l t-.Et^EW J 

(9) 

^ All of them have been computed exactly, except the ones that involve minimizations over 
simplexes of convex weights, for which a Monte-Carlo stochastic approximation method was 
used. 
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Fig. 3 The observed hourly electricity consumptions encountered by the Slovakian subbranch 
between January 1, 2005 and December 31, 2007. 

Table 1 Some characteristics of the observations yt (hourly mean consumptions) of the 
Slovakian data set for the time intervals 11:00-12:00. 



Number of days D 


1095 


Time intervals 


Only 11:00-12:00 


Number of instances T 


1095 (= 1095 X 1) 


Number of experts N 


35 


Unit 


MW 


Median of the yt 


702.6 


Bound B on the yt 


1020.0 




Fig. 4 Graphical representations of the performance of the experts of the Slovakian data set: 
sorted RMSE (left) and R.MSE-frequency of activity pairs (right). 
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Table 2 Definition and performance of several (possibly off-line) benchmark procedures on 
the Slovakian data set; they serve as comparison points for on-line procedures. 



Name of the benchmark procedure 


Formula 


Value 


Uniform sequential aggregation rule 


aMSE(W) 


= 31.1 


Uniform convex weight vector 


rmse((1/35,...,1/35)) 


= 30.7 


Best single expert 


min RMSE(j) 
j=l,...,35 


= 30.4 


Best convex weight vector 


min R.MSE(q) 
qex 


= 29.2 



Best compound expert 
Size at most m = 10 
Size at most m = 50 
Size at most m = 200 

Prescient strategy (size at most m = T — 1 = 1 094) 



min RMSE(jj^) = 32.1 

min RMSE(jj^) = 23.1 
if 6^50 

min RMSE(jj^) = 15.2 

if G-C200 

min RMSE(jJ') = 9.4 

jj^ €Ei X-E2 X ... X-Et 



On the if = 74 elements of a partition of time 
according to the values of the active sets Et 

Best expert on each clement 

Best convex weight vector on each element 



See tek 
See Jiol 



: 29.1 
: 24.5 



which corresponds to the choice of the best expert on each element of the partition, 
and 



E i; 

fc=i t:Et=E<.'') yeBC") 



lif^ fj.t - yt 



with q'-'^-' a convex weight vector on E^^'^ for aU fc = 1, . . . , , (10) 

which corresponds to the choice of the best convex weight vector on each element of 
the partition. Even if there are relatively many elements in this partition, namely, 
K = 74, the gain with respect to constant choices throughout time exists (rmse of 
29.1 versus 30.4 and 24.5 versus 29.2) but is less significant than the one achieved 
with compound experts (which achieve a smaller rmse of 23.1 already with a size 
m = 50). 



4.2 Results obtained with constant values of the parameters 

We now detail the practical performance of the sequential aggregation rules intro- 
duced in Section [2j for fixed values of the parameters -q and a of the rules. We 
report for each rule the best performance obtained; the corresponding parameters 
are said the best constant choices in hindsight. The performance of the families 
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£ti, f^""*, and 5^"'"* is summarized in Table [sj We note that £^"^'^ and 5^'"'*, when 
tuned with the best parameter rj in hindsight, outperform their comparison oracle, 
the best convex weight vector (with a relative improvement of 3 % in terms of the 
rmse), while the performance of the best £ri comes very close to the one of its re- 
spective comparison oracle, the best single expert (rmse of 30.4 versus 30.5). The 
performance of the fixed-share type rules 7^,;,q and T^'^ is reported in Table |4j 



Remark 2 As in Mallet et al. [2009| , the best constant choices in hindsight are 



far away from the theoretically optimal ones, given by 77* ^ 8 x 10 for 
7?* ^ 4 X 10"** for and 77* « 2 x 10"^ for f^'""*. 

We close this preliminary review of performance by showing in Figure [5] that 
the considered rules fully exploit the whole set of experts and do not concentrate 
on a limited subset of the experts. They carefully adapt their convex weights as 
time evolves and remain reactive to changes of performance; in particular, the 
sequences of weights do not converge to a limit vector. 



4.3 Results obtained with an online tuning of the parameters 

We show in this section how the meta-rules constructed in Section |2.4| can get 
performance close to the one of the rules based on the best constant parameters in 
hindsight; we do so, for this data set only, by fixing somewhat arbitrarily the used 
grids. Based on the observed behaviors we then indicate for the second data set 



(in Section 5.5) how these grids can be constructed online. For the exponentially 
weighted average rules £ri and , the order of magnitude of the optimal values 
77* being around 10~^, we considered two finite grids for the tuning of r;, both with 
endpoints 10~^ and 1: a smaller grid, with 9 logarithmically evenly spaced points, 

Is = {lO"^ forfce{0,l,...,8}}, 

and a larger grid, with 25 logarithmically evenly spaced points, 

Ai = [mx IQ^^, for fc G {1, . . . , 8} and tti G {1, 2.5, 5}} U {1}. 

The performance on these grids with respect to the best constant choice of 77 in 
hindsight is summarized in Table [5] We note that the good performance obtained 
for the best choices of the parameters in hindsight is preserved by the adaptive 
meta-rules resorting to thegrids. The sequences of choices of 77 on the largest grid 
Al are depicted in Figure pi 

For the fixed-share type rules Fri,a and J-fi'^, two parameters have to be tuned: 
we need to take a finite grid in /I = (0, -l-oo) x [0, 1], e.g., similarly to above, 

Ips = {(10"'=, q), for fcG {0, 1,...,8} and aG {0.01, 0.05, 0.1, 0.2, 0.3, 0.4}}. 

The performance on this grid is summarized in Table [6] while the sequences of 
choices of 77 and a on the grid ylps are depicted in Figure [t] The same comments 
as above on the preservation of the good performance apply. 
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Table 3 Performance obtained by the sequential aggregation rules £n, flj""*, and 5^'*'' for 
various choices of ri; the smallest RMSE obtained for each rule is underlined. 



Value of r] 


10-8 


lO-'^ 


10-6 


4 X 10-6 


10-5 


10-4 


10-3 


RMSE of £r) 


31.3 


31.2 


30.8 


30.5 


30.9 


32.7 




pgrad 




31.3 


30.9 




29.8 


28.2 


33.5 


ograd 

On 




31.3 


30.9 




29.8 


28.2 


34.7 


Table 4 Performance obtained by the sequential aggregation rules T„, 


a and T^'^ for various 


choices of rj and a; the smallest rmse obtained for each rule is underlined. 




Value r, lO""* 


10-4 


10-3 


10-3 


10-^ 10-2 




2 X 10-4 


2 X 10-3 


of a 0.05 


0.2 


0.1 


0.2 


0.05 0.2 




0.07 


0.2 


RMSE J^r],a 29.3 


29.5 


27.5 


27.2 


28.0 27.8 






27.0 


of T^]^ 28.0 


28.9 


29.3 


29.2 


28.7 28.5 




27.2 





Table 5 Performance obtained by the rules 8-q and flj""* for the best constant choice of r] in 
hindsight (left) and when used as keystones of a meta-rule selecting sequentially the values of 
J) on the chosen grids (middle and right). 





Best constant 77 


Grid As 


Grid Ai 


RMSE of Er, 


30.5 


31.1 


30.7 


cgrad 


28.2 


28.2 


28.4 



Table 6 Performance obtained by the rules Ty-i.a and J>^'a' for the best constant choices of 
rt and a in hindsight (left) and when used as keystones of a meta-rule selecting sequentially 
the values of r] and a on the grid ylps (right). 





Best constant pair (»j, a) 


Grid /Ips 


RMSE of .7^17, a 


27.0 


27.8 


x-grad 


27.2 


28.5 
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Fig. 5 Graphical representations of the convex weights associated at each time instance with 



the 35 experts by the rules £f'^!^-4 (left) and IF, 



2x10-3,0.2 



(right). 



n 




Fig. 6 Graphical representations of the sequences of tuning parameters rj chosen by the 



mcta-rule selecting sequentially the values on the grid Ai 
Sri (right). 



the base rules are £^"^ (left) and 




Fig. 7 Graphical representations of the sequences of tuning parameters rj (left) and a (right) 

chosen by the meta-rule selecting sequentially the values on the grid /Ips; the base rule is 

X-grad 
n.a ■ 
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5 A second data set: Operational forecasting on French data 

The data set used in this part is a standard data set used by EDF R&D de- 
partment. It contains the observed electricity consumptions as well as some side 
information, which consists of all the features that were shown to have a strong 



effect on electricity load; see, e.g., Bunn and Farmer 1985 . Among others, one 



can cite seasonal effects (most importantly, the seasonal variations of day lengths), 
calendar events like vacation periods or public holidays, weather conditions (tem- 
perature, cloud cover, wind), and weekly patterns of days. We summarize below 
some of its characteristics and we refer the interested reader to lDordonnat et al.l 
2008| for a more detailed description. 



It is divided into two sets. The first set ranges from September 1, 2002 to 
August 31, 2007. We call it the estimation set and use it to design the experts, 
which then provide forecasts throughout the period corresponding to the second 
set. This second set covers the period from September 1, 2007 to August 31, 2008. 
We call it the validation set and use it to evaluate the performance of the considered 
aggregation rules. Actually, we exclude some special days from the validation set. 
Out of the 366 days between September 1, 2007 and August 31, 2008, we keep 
320 days. The excluded days correspond to public holidays (the day itself, as 
well as the days before and after it), daylight saving days and winter holidays 
(that is, the period between December 21, 2007 and January 4, 2008); however, 
we include the summer break (August 2008) in our analysis as we have access to 
experts that are able to produce forecasts for this period. The characteristics of the 
observations yt of the validation set (formed by half-hourly mean consumptions) 
are described in Table [7] In this part as well, we omit the unit GW (gigawatt) of 
the observations and predictions of the electricity consumption, as well as the one 
of their corresponding rmse. 

Note that this time we do not split anymore the data set into subsets by the 
half-hours; this is explained in detail below and comes from two facts: the data set 
is smaller (and thus the data subsets would be too small) and we need to abide 
by an operational constraint as far as the forecasting in France is concerned. 



5.1 Brief description of the construction of the considered experts 

The experts we consider here come from three main categories of statistical models: 
parametric, semi-parametric, and non-parametric models. We do so to get experts 
that are heterogenous and exhibit varied enough behaviors. 

The parametric model used to generate the first group of 15 experts is described 
Bruhns et al. [2005| and is implemented in an EDF software called "Eventail." 



(For conciseness we refer to them as the Eventail experts.) This model is based on 
a nonlinear regression approach that consists of decomposing the electricity load 
into a main component including all the seasonality effects of the process together 
with a weather-dependant component. To this nonlinear regression model is added 
an autoregressive correction of the error of the short-term forecasts of the last seven 
days. Changing the parameters (the gradient of the temperature, the short-term 
correction) of this model led to the indicated 15 experts. 

The second group of 8 experts comes from a generalized additive model pre- 
sented in Pierrot et al. 2009 , Pierrot and Goude 2011| and implemented in the 



20 



Marie Devaine et al. 



Table 7 Some characteristics of the observations yt (half-hourly mean consumptions) of the 
French data set of operational forecasting. 



Number of days D 


320 


Time intervals 


Every 30 minutes 


Time instances T 


15 360 {= 320 X 48) 


Number of experts N 


24 (=15 + 8 + 1) 


Unit 


GW 


Median of the yt 


56.33 


Bound B on the yt 


92.76 




Fig. 8 Graphical representations of the performance of the experts of the French data set: 
sorted RMSE (left) and RMSE— frequency of activity pairs (right); Eventail experts are depicted 
by the symbols •, GAM experts are represented by A, while * stands for the similarity expert. 

Table 8 Definition and performance of several (possibly off-line) benchmark procedures on 
the French data set; they serve as comparison points for on-line procedures. 



Name of the benchmark procedure 

Uniform sequential aggregation rule 
Uniform convex weight vector 
Best single expert 
Best convex weight vector 



Formula 

rmse(W) 
rmse((1/24, ...,1/24)) 
min rmse(i) 
min rmse(<7) 



Value 

= 0.724 
= 0.748 
= 0.782 
= 0.683 



Best compound expert 

Size at most m = 50 

Size at most m = 100 

Size at most m = T — 1 = 15 359 



min RMSE(j^) 
if S-C50 

min RMSE(jf) 
jf s£ioo 

min R.MSE(jr^) 

jj^e-Ei X i?2 X ... X Ej- 



: 0.534 
: 0.474 
: 0.223 



Forecasting electricity consumption by aggregating specialized experts 



21 



software R by the mgcv package developed by Wood 2006 . (We refer to them as the 
GAM experts.) The considered generahzed additive model imports the idea of the 
parametric modeling presented above into a semi-parametric modeling. One of its 
key advantages is its ability to adapt to changes in consumption habits while para- 
metric models like Eventail need some a priori knowledge on customers behaviors. 
Here again, we derived the 8 GAM experts by changing the trend extrapolation 
effect (which accounts for the yearly economic growth) or the short-term effects 
like the one-day- lag effect; these changes affect the reactivity to changes along the 
run. 

The last expert is drastically different from the two previous groups of experts 
as it relies on a univariate method (i.e., a method not requiring any exogenous 



factor like weather conditions); this method is presented in Antoniadis et al. 



2006 



2010] . Its key idea is to assume that the load is driven by an underlying stochastic 



curve and to view each day as a discrete recording of this functional process. 
Forecasts are then performed according to a similarity measure between days. We 
call this expert the similarity expert. 



5.2 Benchmark values: performance of the experts and of some oracles 

The characteristics of the experts presented above are depicted in Figure m here 
again with a bar plot representing the (sorted) values of the rmse of the 24 available 
experts and a scatter plot relating the rmse of each of the expert to its frequency of 
activity. Out of the 15 Eventail experts, 3 are active all the time; they correspond to 
the operational model actually used at the R&D center of EDF and to two variants 
of it based on different short-term corrections. The other 12 Eventail experts are 
inactive during the summer as their predictions are redundant with the 3 main 
Eventail experts (they were obtained by changing the gradient of the temperature 
for the heating part of the load consumption, which generates differences to the 
operational model in winter only). GAM experts are active on an overwhelming 
fraction of the time and are sleeping only during periods when R&D practitioners 
know beforehand that they will perform poorly (e.g., in time periods close to public 
holidays); the lengths of these periods depend on the parameters of the expert. 
Finally, the similarity expert is always active. 

We report in Table |8] the performance obtained by most of the oracles already 
discussed in Section [4. 1[ We do not report here the performance obtained by con- 
sidering partitions of the time in terms of the values of the active sets Et, as, on the 
one hand, the study of Section [4. 1| showed that even when the number of elements 
K in the partition was large, the compound experts had better performance, and 
on the other hand, as the value of K is small here {K = 7); these two facts explain 
that the performance of the oracles based on partitions is to expected to be poor 
on this data set. 

We note the disappointing performance of the best single expert with respect 
to the naive rule U. Unlike in Section |4.1[ this comes from our experts being 
more active in challenging situations. Indeed, the rule U also performs better than 
the uniform convex weight vector, which induces at each time instance the same 
forecast as the rule U but for which the loss incurred at a given time instance 
is more weighted as more experts are active. All in all, the poor performance of 
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the best single expert or of the uniform convex weight vector are caused by the 
considered specialized experts being more active and more helpful when needed. 

From Table [s] we mostly conclude the following. The true benchmark values 
from the first part of the table are the rmse of the rule U -that all fancy rules 
have to outperform to be considered worth the trouble- and the rmse of the best 
convex weight vector. The second part of the table indicates that important gains 
in accuracy are obtained with compound experts (and therefore, fixed-share type 
rules are expected to perform well, which will turn out to be the case). 



5.3 Extension of the considered rules to the operational forecasting constraint 

We consider prediction with an operational constraint required by EDF consisting 
of producing half-hourly forecasts every day at 12:00 for the next 24 hours; that is, 
of forecasting simultaneously the next 48 time instances. (The experts presented 
above also abide by this constraint.) The high-level idea is to run the original rules 
on the data (called below the base rules), access to the proposed convex weight 
vectors only at time instances of the form t}. = 48fc + 1, and use these vectors for 
the next 48 time instances, by adapting them via a renormalization or a share 
update to the values of the active sets Et^+i, . . . , Et^+4s- 

We also propose another extension related to the structure of the set of experts. 
The latter are of three different types and experts of the same type are obtained 
as variants of a given prediction method (GAM, Eventail, or functional similarity 
estimation). It would be fair to allocate an initial weight of 1/3 to the group of 
GAM experts, which turns into an initial weight of 1/24 to each of the 8 GAM 
experts; a weight of 1/3 to the group formed by the 15 Eventail experts, that is, an 
initial weight of 1/45 to each of them; and an initial weight of 1/3 to the similarity 
expert. We denote by pjo the initial weight of an expert j. We will call fair initial 
weights the convex weight vector described above (with components equal to 1/3, 
1/24, or 1/45) and uniform initial weights the vector defined by pj q = 1/24 for all 
experts j. The effect of this on the regret bounds, e.g., ([3]) or (|5]), is the replacement 
of In A'^ by maxj In 1/pj.o- This does not change the order of magnitude in T of the 
regret bounds but only increases them by a multiplicative factor. 

All in all, we denote by W,, and W^""'' the adaptations to the operational 
constraint of the rules S,^ and S^"^"^ of Sections 
ones of the rules Srj and 5^""^ of Sections 



2.1 



2.1 



and 



and 



2.2 



2.2 



and 



by Trj and T^"" the 
by gn,a and ^^'S* the 
2.3 For instance, Wrj uses. 



ones of the rules J^ri,a and J^'.^ described in Section 
at time t = 1,2, ... ,T, the weight vector defined by 

for all experts j, with the usual convention that empty sums equal 0. (The notation 
[_x\ denotes the lower integer part of a real number x.) 

Similarly, as is illustrated in its statement in Figure [9j Qri,a basically needs to 
run an instance oiTri,a and to access to its proposed weight vector every 48 rounds. 
Between two such synchronizations, only share updates (and no loss update) are 
performed, to deal with the fact that experts are specialized. Indeed, the values of 
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Parameters: rj > and ^5 a ?J 1, as well as an initial convex weight vector (pi,Oi • ■ ■ i Pn,o) 
Initialization: (wifi,... ,wn,o) = (pi,o I{ieBi}, ■ ■ ■ , PiV, I{jvgEi}) 
For each round t = 1, 2, . . . , T, 

1 ^ 

(1) yt = ^ Y^^j.t-i fj.f, 

(2) [loss and share updates] 

if t = 48A; for some k, observe yt—ij, ■ ■ ■ ,yt and tak^ {^1,1, ■ ■ ■ , wj^t) = p^^i {^ri,a)', 

(3) [share update] 

otherwise (when t is not a multiple of 48), let uij_t = if j ^ Et+\ and 

if J S Et+i (with the convention that an empty sum is null). 

" Pt+i(-^»7,a) convex weight vector chosen by the rule Tn,a after seeing the sequence 

of observations yi,...,yt and the corresponding expert predictions; we use here the same 
notation as in Section |2.4[ where we indicated in parentheses the name of the rule whenever 
it was needed. Here, the rule G??,^ thus synchronizes again with J^r],a at steps t of the form 
tfc = 48fc for some k. 

Fig. 9 The extension Gtj.q of the (basic) fixed-share aggregation rule J^n.a to operational 
forecasting. 



the sets of active experts Et may (and do) vary within a one-day-ahead period of 
time. 

Theoretical bounds on the regret can be proved since, as is clear from the 
algorithmic statements of the extensions, the weights output by the base rules are, 
for all t, close to the ones of their adaptations (and of course, coincide with them 
at the time instances t^.). This is because these weights are computed on almost 
the same sets of losses; these sets differ by at most 47 losses, the ones between the 
last tf; and the current instance t. A quantification of this fact and a sketch of a 
regret bound, e.g., for Wrj, are provided in the appendix (Section [Pj). 



5.4 Results obtained with constant values of the parameters 

The performance of the extensions Wrj, W^""*, 7^, and T^"^"^ described above is 
summarized in Table [9j We note that the gradient versions of the forecasters 
(for both priors) outperform the comparison point formed by the rmse of the best 
convex weight vector, equal to 0.696, and which was the only interesting benchmark 
value among the oracles of the first part of Table [8] They do so by a relative factor 
of about 5%; on the other hand, their basic versions (in case of a fair prior) get 
only a slightly improved performance with respect to this comparison point. It is 
also worth noting that the performance of the gradient versions is not sensitive to 
the initial allocation of weights. 

Remark 3 Here again, as already mentioned for the Slovakian data set in Sec- 
tion |4[2j the best constant choices in hindsight are far away from the theoretically 
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optimal ones, given by values rj* of the order of 10~® on the present data set. For 
such small values of 77, the rules are basically equivalent to the uniform aggregation 
rule W, as is indicated by the performance reported in Table [9] 

The performance of the extensions Qrj^a and Grj'.'a described above is summa- 
rized in Table 10 (It turned out that the performance of the algorithms did not 
depend much on whether the initial weight allocation was fair or uniform and we 
report only the results obtained by the latter in the sequel.) The comparison points 
are given by the best compound experts studied in Table [8] which exhibited an 
excellent performance. This is why we expected and actually see a significant gain 
of performance for the aggregation rules when resorting to forecasters tracking the 
performance of the compound experts. Table [To| shows a relative improvement in 
the performance of about 5 % with respect to the results of Table [9] 



5.5 Results obtained with a fully online tuning of the parameters 

In Sections |2.4| and |4.3| we indicated that our simulations showed that the step of 
the grid was not too crucial parameter and that the results were not too sensitive to 
it; we however did not clarify how to choose the maximal (and also the minimal) 
possible value(s) of ri in the considered grids, i.e., how to determine the right 
scaling for r/. The procedure is based on the observation that in Figures |6] and [7] 
of Section |4.3| the selected parameters rjt are eventually constant or vary in a 
small range. It thus simply suffices to ensure that the constructed grid covers a 
large enough span. This can be implemented by extending online the considered 
grid as follows. We let the user fix an arbitrary finite starting grid, say, reduced 
to {!}. At any time t when the selected parameter rjt^i is an endpoint of the 
grid, we enlarge it by adding the values 2^Tjt-i, for r G {1,2,3}, respectively, 
for r G {—1,-2,— 3}, if the endpoint was the upper limit, respectively, the lower 
limit of the grid. (We tested different factors than the factor of 2 considered here 
and also tried to increase the grid with more than three points; no such change 
had an important impact on the performance.) The possible choices for a are in 
the (known) bounded range [0, 1] and therefore no scaling issue takes place. We 
considered a fixed grid of possible a given by 

aG {0, 0.005, 0.01, 0.05, 0.1, 0.2, 0.5, 1} . 

The performance of this adaptive construction of the grids used by the meta- 
rules with respect to the best constant choices in hindsight is summarized in Ta- 
bles [TT] and [12] We observe that the now fully sequential character of the meta-rule 
comes at a limited cost in the performance. (That cost would be almost insignif- 
icant if a training period was allowed, so as to start the evaluation period with a 
grid already large enough.) 



5.6 Robustness study of the considered aggregation rules 

In this section we move from the study of global average behaviors of the aggrega- 
tion rules (as measured by their rmse) to a more individual analysis, based on the 
scattering of the prediction residuals yt — yt- The rmse is indeed a global criterion 
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Table 9 Performance obtained by the sequential aggregation rules Wr/, W^"^"'^, Tq, and T^'^^^ 
for various choices of i); the smallest RMSE obtained for each rule is underlined. 



Values 


of rt 


Prior 


10-6 


10-5 


2 X 10-" 


10-3 


2 X 10-2 


10-1 


2 


RMSE 




(unf.) 


0.724 


0.722 


0.718 


0.731 


0.784 


0.783 


0.784 




Wv 


(fair) 


0.736 


0.731 


0.684 


0.722 


0.785 


0.784 


0.785 






(unf.) 


0.724 


0.722 


0.705 


0.683 


0.631 


0.640 


0.629 






(fair) 


0.737 


0.733 


0.697 


0.674 


0.633 


0.641 


0.640 




Tr, 


(unf.) 


0.724 


0.722 


0.718 


0.731 


0.785 


0.783 


0.752 




% 


(fair) 


0.736 


0.731 


0.684 


0.721 


0.786 


0.784 


0.753 




-T-grad 

'v 


(unf.) 


0.724 


0.712 


0.705 


0.683 


0.631 


0.640 


0.741 




-T-grad 

'v 


(fair) 


0.737 


0.733 


0.697 


0.674 


0.633 


0.641 


0.855 



Table 10 Performance obtained by the sequential aggregation rules 6ri,a and G^]^ run with 
an initial uniform allocation of the weights for various choices of rj and a; the smallest rmse 
obtained for each rule is underlined. 



Values 


of r] 


0.01 


0.01 


0.01 


1 


1 


1 


500 


500 


500 




of a 


0.001 


0.01 


0.05 


0.001 


0.01 


0.05 


0.001 


0.01 


0.05 


RMSE 


Grt^fx 


0.678 


0.683 


0.704 


0.711 


0.659 


0.652 


0.674 


0.633 


0.632 






0.646 


0.669 


0.700 


0.622 


0.598 


0.637 


0.683 


0.675 


0.671 



Table 11 Performance obtained by the rules Wr;, W^'" , T-q, and TJf"^'' for the best constant 
choice of r) in hindsight and when used as keystones of a meta-rule selecting sequentially the 
values of ?? based on an adaptive grid; results are reported for both the uniform and fair priors. 





Uniform 


prior 


Fair 


prior 




Best constant r) 


Adaptive grid 


Best constant r] 


Adaptive grid 


RMSE of y\>n 


0.718 


0.724 


0.684 


0.696 


y^;grad 


0.629 


0.640 


0.633 


0.644 


% 


0.718 


0.723 


0.684 


0.698 


.T-grad 


0.631 


0.640 


0.633 


0.645 



Table 12 Performance obtained by the rules Qn,a and C/Jj'a run with an initial uniform 
weight allocation for the best constant choices of i) and a in hindsight (left) and when used as 
keystones of a mcta-rulc selecting sequentially the values of r] based on an adaptive grid and 
the values of a according to a fixed grid (right). 





Best constant pair {rt, a) 


Adaptive grid 


RMSE of Q 71,01 


0.632 


0.658 




0.598 


0.623 
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1 I \ I r 

5 10 15 20 

Half hours 



Fig. 10 Half-hourly RMSE of the meta-rules based on the rules Wlj''^'' (symbol: □) and G^^ji'' 
(symbol: •); as well as the ones of the best overall single expert (solide line) and of the best 
overall convex weight vector (dashed line). 




— I 1 1 1 1 

5 10 15 20 

Half hours 

Fig. 11 Using the same rules and benchmarks as in Figure [lo] with the same legend: 50% 
(black), 75 % (grey), and 90 % (black) quantiles of the absolute values of the residuals, grouped 
per half hours. 
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and we want to check that the overall good performance does not come at the 
cost of local disasters in the accuracy of the aggregated forecasts. To that end we 
split the data set by the half hours into 48 sub-data sets; for each of these subsets 
we compute the rmses of some of the benchmarks and aggregation rules discussed 
above and study also the scattering of the (absolute values of the) prediction 
residuals. To do so we consider two fully sequential aggregation rules, namely, the 
meta-rules based on families of W^"'"'^ and G^]^ run with initial uniform weight al- 
locations. We use as benchmarks the (overall) best single expert and the (overall) 
best convex weight vector, whose performance was reported in Table |8j 

Figure [To] plots the half-hourly rmse of these two aggregation rules and of these 
two benchmarks. It shows that the performance of the rule based on exponential 
weighted averages is, uniformly over the 48 elements of the partition of days in 
half hours, at least as good as the one of the best constant convex combination 
of the expert forecasts. The performance of the rule based on fixed-share aggrega- 
tion rules is intriguing: its accuracy is significantly improved with respect to the 
one of the latter benchmark between 12:00 and 21:00 but is also slightly worse 
between 6:00 and 12:00. It thus seems that this rule has excellent performance on 
very short-term horizon and would probably strongly benefit from an intermedi- 
ate update around midnight (this is however not the purpose of the present study: 
intra-day forecasting is left for future research). A similar behavior is observed in 
Figure 11 which depicts the medians, the third quartiles, and the 90% quantiles 
of the absolute values of the residuals grouped by half hours. In addition, we see 
that the distributions of the errors of the aggregation rules are more concentrated 
than the ones of the best benchmarks, which indicates that their good overall per- 
formance does not come at the cost of some local disasters in the quality of the 
predictions. 

All in all, we conclude that the best aggregation rules never encounter large 
prediction errors in comparison to the best expert or to the best convex combina- 
tion of experts and often encounter much smaller such errors. This is strongly in 
favor of their use in an industrial context where large errors can be highly prejudi- 
cious (potential issues range from financial penalties to black outs). In a nutshell, 
aggregation rules are seen to reduce the risk of prediction, which is one important 
pro for operational forecasting. 



6 Conclusions 



On the theoretical side, we reviewed and extended known aggregation rules for 
the case of specialized (sleeping) experts. First, we provided a general analysis of 
the specialist aggregation rules of Freund et al. [1997 for all convex loss functions. 



while the original reference needed an ad hoc analysis for each loss function of 
interest. Second, we showed how the fixed-share rules of [Herbster and Warmuth] 
[iggs" can accommodate specialized experts: they form a natural and efficient 
alternative to the specialist aggregation rules. Finally, for all these rules, as well 
as the exponentially weighted average ones, we indicated how to extend them so 
as to take into account some operational constraint of outputting simultaneous 
forecasts for a fixed number of future time instances. 

We then followed a general methodology to study the performance of these rules 
on real data of electricity consumption. In particular, we provided fully adaptive 
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methods that can tune onhne their parameters based on adaptive grids; doing so, 
they outperform clearly the rules tuned with the theoretically optimal parameters. 
All in all, for the two data sets at hand the best rules, given by fixed-share type 
rules, improve on the accuracy of the best constant convex combination of the 
experts by about 5% (Slovakian data set) to about 15% (French data set). In 
addition, we noted that resorting to the gradient trick described in Section |2.2| 
always improved the performance of the underlying aggregation rule. Finally, the 
raw improvement in terms of the global performance, as measured by the rmse, 
of the sequential aggregation rules over the (convex combinations of) experts, also 
comes together with a reduction of the risk of large errors: the studied aggregation 
rules are more robust than the base forecasters they are using. 
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A Proof of Theorem [2] 

Proof One can show by in duct ion that the vectors wt are convex weight vectors. We use the 
notation defined in Section |2.2| for the normalization of convex weight vectors ij to a given 
set of active experts E; then, the convex combination used by cS,; at round t can be written as 
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By convexity of the loss functions It, tlie regret witli respect to some expert j can be 
bounded as 

i?T(5„ j) < ^ ( ^ <| ^t(5») - ki&j)] loEB,} • 
t=i \ieBt / 

Hoeffding's lemma (see, e.g., [Cesa-Bianchi and Lugosi||2006| Lemma A.l]) entails that for all 
t such that j ^ Et, 

ieEt ^ \iSEt 

= --In-^:^ + lL2=et{Sj)^-ln^^ + {L\ 

1) "'i.t+i 8 ri Wj^t+i 8 

where we used that the update of the weight of an expert j £ Et can be rewritten by definition 
as 

Wi t+l = W) t e-'"^**''^) i . 

For j ^ Et, we have that Wj^t+\ = "'j.ti again by definition of the rule. Thus a telescoping 
sum appears and we get 

The proof is concluded by noting that Wj i/wj j'+i ^ 1/A'' as Wj i = l/N and Wj x+i ^ 1- 



B Proof of Theorem [3] 

The following proof is a straightforward adaptation of the techniques presented in [Cesa-| 
[Bianchi and Lugosi[ |2006[ Section 5.2]. Its only merit is to show how the share update was 
obtained in f''igure[2| 

Proof We first note that by convexity of the it, 



T 



max iJT(J-^,c, jT) ^ E E P^,ttt{S.) ~ ^'fe) • (12) 

We now use the same proof scheme as in [Cesa-Bianchi and Lugosi ] |2006| Section 5.2] and 
show that the rule T^^a is simply an efHcient implementation of the rule that would, at each 
round t, choose a convex weight vector with components proportional to 



ifj^^^t, 



where v is some prior probability distribution over £, to be defined below. It then follows from 



Cesa-Bianchi and Lugosi 2006 Lemma 5.1] that for all S C, 



|:(EK.M..)-M^.)).iln-J^-.^. (13) 

To get the stated bound, we thus need, one the one hand, to define the distribution v, and 
on the other hand, to show that indeed performs the efficient implementation indicated 

above. 
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[First part: Definition of u] In the sequel we denote by the cardinality of a subset E of 
{1, . . . , N}. We fix a real number o S [0, 1] and consider the following probability distribution 
V over the sequences of (legal and illegal) experts, i.e., over {1, . . . , A'^}-^. For each element 
jj" S C, we denote by m its size, hy ti, ... ,tm. the instances 1 ^ t ^ T — 1 such that jt 7^ jt+l , 
and by T the set of instances I ^ t ^ T — 1 such that jt = jt+i ; we then set 



1 / X m 



for jj^ ^ C, we set = 0. This application u indeed defines a probability distribution as 

can be seen by introducing the uniform distribution /ii over Ei and the following transition 
functions Trj : {1, . . . , N}'^ [0, 1]; for all i,j, 

ifj^St+i; (14) 

(1 - q) + a/|£;t+i| i e St+i and i = j; (15) 

a/\Et+i\ j G Et+i, i £ Et+i, and i j; (16) 

l/\Et+i\ j G Et+i and i ^ Et+i. (17) 



Trt(j^i) 



Its interpretation is as follows. We never switch to an inactive expert, as is ensured by l |14| . 
If we can stay on the same expert (if the cu rren t expert remains active), then we do so with 
a pro bab ility slightly larger than 1 — a, see ( |15| . If we could have stayed on the same expert, 
then ( |14[| in dicates that we switch with probability a/|_Et_|_i| to a different expert in Et+i. 
Finally, l |17[ l controls the case when the current expert becomes inactive and we need to switch 
to a new expert for the compound expert to be legal. 

Now, we note that for all i and t, by distinguishing whether i £ Et+i or i ^ ^t+ii 

iv 

^Trt(i^j) = l 
i=i 

and that, for all jj^ £ {1, . . . , A'}-^ (all of them-the legal and the illegal ones), 

'^{jI) = Mi(ii) n Trt(it jt+l) ■ (18) 

To prove the stated bound, assuming we have proven as well that Pj = p[ fo r al l t (which 
we do below, in the second part of the proof), it suffices to combine \12\ and \Vi\ with the 
following immediate lower bound on the u(^j'^), 

K>n>i(na-<.>)(n^)-ia-")^-"-(S 

which we obtained by upper bounding all cardinalities \Et\ by Af in the definition of u and by 
using ^ a ^ 1. (The obtained bound is actually exactly the one of [Cesa-Bianchi and Lugosi] 
|2006 Theorem 5.2], due to the loose way we lower bounded i/.) 



"[Second part: Proof of the efficient im plementation] The proof goes by induction and 
mimics exactly the one of [Cesa-Bianchi and L ugosi 2006, Theorem 5.1]. It suffices to show that 
for all j £ {1, . . . , TV} and t S |0, . . . , T — 1}, one has Wj^t ~ f ™® fiJ^st note that 

thanks to ( |18[ l, the distribution can be interpreted as the distribution of an inhomogeneous 
Markov process, hence | |18| l indicates the distribution that 1/ induces over {1, . . . , N}" , for all 
1 s ^5 T; the latter is given by simply replacing T by s in | |18[ l. We can therefore rewrite w'^ ^ 
as 

E Kir)--''^*-'=^^='l0.,,=.}. (19) 
ii.---.it+i 

where the first sum is (indifferently) taken over {1, . . . , NY^^ or Bi x . . . x Et+i. For t = 0, 
we get 

AT 
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by definition of v and of the uij.o (we recall that /ii denotes the uniform distribution over -Ei). 
Now, we assume that for some t ^ 1, we have proved that "Wi^t-x = "^'^ t_\ for all i £ {1, . . . , A''}. 
For j £ by the share update in Figure|2]and by the induction hypothesis, 



+ (l-a)n{,gB,nE,+i}^,t-ie-''^'<*^'. 
By definition of the transition functions ( |14| -( [T7| , this equality can be rewritten as 



Substituting (|19| in this equality, we get 



j\,...,it ieEt 



w'- , = 0. This concludes this proof. 



C Proof of Corollary [2] 

This proof uses the same methodology as the one of Corollary [T] 

Proof We fix a compound weight vector £ Cm and denote by C{q\) C Cm the set of com- 
pound experts that are compatible with in the following sense: denoting by ii , . . . , tm 
the time instances 1 s ^ T — 1 such that ^ Qs+ii the elements j"^ in C(^q^) are char- 
acterized by the fact that js ^ js+l only if s = tfe for some fc S {1, . . . , m}. We insist on the 
fact that this is a "only if" statement and not an "if and only if" statement; this means that 
the switches in the sequences j"^ S C(^qf) can only occur (but are not bound to occur) at the 
indexes of the switches in q^ . 

Now, we recall that by the gradient trick recalled in Section [2^2] 

T 

Since the it are linear over X, the last expression can be upper bounded by 
T T 
J2(^tiPt) - liQt)) ^ max E(^*(Pt) - ^'fe)) ' 

which shows that in particular, 

T T 

^(£t(Pi)- Jt(qt)) max E (^'(P') " ^* )) = iJr jf ) . 

t=i H s-C™ ii e£„i 

The proof is concluded by noting that Theorem [s] exactly ensures that the rule is such 

that 

max RT(^,^:S',in ^ ^^InJV+lln ^ + I'^'^^fT . 
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D Sketch of a regret bound on the operational adaptation W77 of Er] 

We provide a proof by approximation and show that the regret of '\^J^ is b oun ded by the regret 
of £r] plus some small term. To do so, we compare the definitions |2| and e.g., in the case 

when pj o = 1/24 for all experts j. 

Since fi48[(t-i)/48J (Sv'j) ^^'^ Rt—l{£-q,j) differ by at most 47 instantaneous regrets, each 
of which is bounded between —B^ and , the ratio between the numerators of ([2| and as 
well as the one between their denominators, lie in the interval [e"*'^''^ , e^^''^ ] . Therefore, 
the ratios of the weights defined in jijl and are m the interval [e-^^iS ^ g94,,s j ^ rpj^^^^ 
using a gradient bound, the difference between the regrets of interest can be bounded as 

^?T(>V^,i) - RriSvJ) ^ 2B2max{e''S4i32 „ i_ ^ _ ^-v94B^jT, 

which, for 77 small enough, is of the order of B'^rjT. Taking r] of the the order of 1/ \/T, which 
is also the optimal order of magnitude for the bound on fly (£",,, j) stated in Thcorem[l] entails 
that = 0{\/T) = o{T), as asserted above. 



